-
Notifications
You must be signed in to change notification settings - Fork 3
Add ability to process local docs #318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds functionality to process local documents in addition to web-based URLs. The implementation introduces a new load_local_docs utility and corresponding file loader services for HTML and PDF files, enabling users to provide local file paths alongside URL-based sources.
Key Changes
- Added
AsyncLocalFileLoadersupport for processing local PDF and HTML files - Introduced
known_local_docsparameter to allow users to specify local document paths - Modified processing pipeline to check local documents before URL-based searches
- Updated class references from
AsyncFileLoadertoAsyncWebFileLoaderfor clarity
Reviewed Changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| compass/utilities/io.py | New utility module for loading local documents |
| compass/services/cpu.py | Added functions to read local PDF files (with and without OCR) |
| compass/services/threaded.py | Added HTMLFileLoader service and functions to read local HTML files |
| compass/scripts/process.py | Updated main processing function to support local documents and modified processing order |
| compass/scripts/download.py | Added load_known_docs function for loading local documents |
| compass/utilities/nt.py | Added known_local_docs field to ProcessKwargs namedtuple |
| compass/utilities/init.py | Exported new load_local_docs function |
| compass/validation/location.py | Updated class reference from AsyncFileLoader to AsyncWebFileLoader |
| compass/validation/content.py | Made legal_text_validator parameter optional in parse_by_chunks |
| compass/extraction/apply.py | Added logic to skip legal text and date validation for known documents |
| compass/web/website_crawl.py | Updated class references and fixed documentation typo |
| tests/python/unit/utilities/test_utilities_io.py | New test file for local document loading functionality |
| tests/python/unit/utilities/test_utilities_base.py | Minor docstring correction |
| tests/python/unit/validation/test_validation_location.py | Added missing @pytest.mark.asyncio decorator |
| tests/python/integration/test_integrated.py | Updated class references from AsyncFileLoader to AsyncWebFileLoader |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Add option to process local documents, similar to how we allow processing known URLs